FishEye International, a non-profit focused on countering illegal, unreported, and unregulated (IUU) fishing, has been given access to an international finance corporation’s database on fishing related companies. In the past, FishEye has determined that companies with anomalous structures are far more likely to be involved in IUU (or other “fishy” business). FishEye has transformed the database into a knowledge graph. It includes information about companies, owners, workers, and financial status. FishEye is aiming to use this graph to identify anomalies that could indicate a company is involved in IUU.
FishEye analysts have attempted to use traditional node-link visualizations and standard graph analyses, but these were found to be ineffective because the scale and detail in the data can obscure a business’s true structure.
The research below aim to help FishEye develop a new visual analytics approach to better understand fishing business anomalies.
We will use visual analytics to understand patterns of groups in the knowledge graph and highlight anomalous groups.
Task 1: Use visual analytics to identify anomalies in the business groups present in the knowledge graph.
Task 2: Develop a visual analytics process to find similar businesses and group them. This analysis should focus on a business’s most important features and present those features clearly to the user.
1. Data Pre-processing and cleaning
Load the library and read the json relationship file MC2.
jsonlite: A lightweight R package for working with JSON data, providing functions to convert JSON to R objects and vice versa.
tidygraph: A tidyverse package that provides a tidy and consistent approach to working with graph data structures, allowing for easy manipulation, visualization, and analysis of networks.
ggraph: An extension of the ggplot2 package that specializes in creating aesthetically pleasing and customizable visualizations of graphs and networks.
visNetwork: An R package that utilizes the vis.js library to create interactive network visualizations, allowing for exploration and analysis of complex networks.
tidyverse: A collection of R packages, including ggplot2, dplyr, tidyr, and others, designed to provide a cohesive and consistent framework for data manipulation, visualization, and analysis.
shiny: An R package for building interactive web applications directly from R code, enabling the creation of user-friendly and responsive data-driven applications.
plotly: An R package that provides a high-level interface for creating interactive and dynamic visualizations, allowing users to explore and analyze data through features like hover effects, zooming, and panning.
graphlayouts: An R package that offers various algorithms for laying out and visualizing graph structures, providing options for arranging nodes and edges in a visually meaningful way.
ggforce: An extension package for ggplot2 that extends its capabilities by introducing new geoms, statistical transformations, and scales, enabling users to create more advanced and specialized plots.
tidytext: A tidyverse package that provides tools for text mining and analysis, allowing users to manipulate, explore, and visualize text data using the principles of tidy data.
skimr: An R package that provides concise and informative summaries of data frames, providing a quick overview of variables’ distributions, missing values, and other summary statistics.
Show the code
#echo | false#tidytext -- text mining library with R: https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html#Load Libraries pacman::p_load(jsonlite,tidygraph, ggraph, visNetwork, tidyverse, shiny, plotly, graphlayouts, ggforce, tidytext,skimr) #load Data MC3<-fromJSON("data/MC3.json")
Data Cleaning for MC3 Nodes and Edges
We picked the desired fields and reorganized the columns using select function. The nodes in MC3 will be companies or person, and description about companies, with their product and services, country and revenue generated.
As we load the data, we found this diagram is not directed, so we will not know the in/out direction of connection.
Below code extract out nodes out for further processing.
country id product_services revenue_omu
0 0 0 0
type
0
Show the code
#Extract and mutate the format so it's not list but dataframe MC3_nodes_clean <- MC3_nodes %>%mutate(country =as.character(country),id =as.character(id),product_services =as.character(product_services),revenue_omu =as.numeric(as.character(revenue_omu)), #we need to convert to numeric directlytype =as.character(type)) %>%select(id, country, type, revenue_omu, product_services)
The original data do not have NA value, however by transforming data into table format, some fields are NA.
Show the code
#check data quality, find missing valuecolSums(is.na(MC3_nodes_clean))
id country type revenue_omu
0 0 0 21515
product_services
0
Show the code
#check which are the types?unique(MC3_nodes_clean$type)
Out of the total Nodes 21515 out of 27622 rows do not have value for revenue_omu. The ratio of missing value in revenue_omu is 77.9%. We will need to deal with this Missing values. And there are 22929 out of 27622 rows have unique ids, there are duplicates with id. The ratio of non-duplicate id is 83.0%.
Remove duplicates in nodes: If two rows with duplicate id but with different value in any other 4 columns (country, type, revenue_omu, product_services), keep both rows the duplicate id. If the two rows are identical for all columns, we remove the duplicate row.
Show the code
#check which are the duplicate ids duplicate_ids <- MC3_nodes_clean[duplicated(MC3_nodes_clean$id), "id"]#use R base function duplicate to achieve this MC3_nodes_clean <- MC3_nodes_clean[!duplicated(MC3_nodes_clean), ] DT::datatable(MC3_nodes_clean)
After removing duplicates, around 2000 rows has been removed, out of the total 4693 duplicate ids.
Below code extract out edges out for further processing.
There is no missing value in edges data. Explore the dataset.
Show the code
#check which are the types?unique(MC3_edges$type)
[1] "Company Contacts" "Beneficial Owner"
Show the code
MC3_edges_clean <- MC3_edges %>%mutate(source =as.character(source),target =as.character(target),edgeType =as.character(type)) %>%group_by(source, target, edgeType) %>%summarise(weights =n()) %>%filter(source!=target) %>%ungroup()#datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document. DT::datatable(MC3_edges_clean)
What are the types for edge and nodes?
From the exploration above, we know there are 3 node type possible: Company, Company Contacts, Beneficial Owner, and there are 2 edge type possible: Company Contacts, Beneficial Owner
Below plot shows the proportion of each type in node and edge respectively.
In order to find the business group, we will check the type of different category of data. There might be owner - business, customer - business, business - business relationship
Assumption: in this case, we will assume Node type = Company, indicating the node is a legal entity, while node type = Company Contacts, Beneficial Owner, the node is a natural person
Check source and target types mapping
From edge file, first we explore if there’s any mapping information in node for each of the source and target item
Show the code
# Count the number of targets in MC3_edges_clean that exist in MC3_nodes_clean existing_targets_count <- MC3_edges_clean %>%mutate(target =as.character(target)) %>%semi_join(select(MC3_nodes_clean, id), by =c("target"="id")) %>%summarise(targets_found =n_distinct(target))print(existing_targets_count)
# A tibble: 1 × 1
targets_found
<int>
1 0
Show the code
# Count the number of sources in MC3_edges_clean that exist in MC3_nodes_clean existing_sources_count <- MC3_edges_clean %>%mutate(source =as.character(source)) %>%semi_join(select(MC3_nodes_clean, id), by =c("source"="id")) %>%summarise(sources_found =n_distinct(source))print(existing_sources_count)
# A tibble: 1 × 1
sources_found
<int>
1 4880
There were sources id found in node file but no target id was found at all. Looking into the edge data, the targets seems are all person’s name, hence the in the edge table, the edgeType should be the target type.
Show the code
# Consolidate node information and add source node types, target node type will not be found from node join. Edge type is treated as the target type MC3_edges_clean_Join <- MC3_edges_clean %>%left_join(select(MC3_nodes_clean, #%>% filter(type == "Company"), id, sourceNodeType = type), by =c("source"="id")) %>%group_by(source, target, sourceNodeType) %>%filter(source != target) %>%distinct() %>%ungroup()# Plot the stacked bar chartggplot(data = MC3_edges_clean_Join, aes(x = sourceNodeType, fill = edgeType)) +geom_bar() +labs(title ="Source Node Types with Breakdown of Target edgeType",x ="Source Types",y ="Count") +scale_fill_discrete(name ="Target Type")+coord_flip()
They are some observations of source type from benefit owner/company contacts to target type of benefit owner/company contacts. By right all sources should be company.As the number of none company source are lesser, consider exclude those are company contacts/beneficial owner type.
For source type of company, we may obtain additional company information mainly from Node (Revenue, Product/Services), and people related information can be obtained from Edge (Edge Type). We can use this to derive a new nodes data frame from edges data frame.
2.Focus on Fish related companies
2.1 Text sensing with tidytext
word count
Since we are interested to understand fishing business anomalies, do a simple word count to understand how many companies has fishing related activities, by simply counting occurrence of “fish” from product services field.
Show the code
#start a bit of text sensing, display the result by max value first MC3_nodes1 <- MC3_nodes_clean %>%mutate(n_fish =str_count(product_services, "fish")) %>%arrange(desc(n_fish))library(ggplot2)ggplot(data = MC3_nodes1, aes(x = n_fish)) +geom_histogram(bins=20, boundary =20,color="grey25", fill="grey90") +ggtitle("Distribution of keyword Fish frequency")
From the distribution we can see not all companies are fishing related. Hence we could filter out some companies.
Tokenisation
We then do tokenization of the Product_Services In text sensing, tokenisation is the process of breaking up a given text into units called tokens. We will discard characters like punctuation marks in this progress.
The two basic arguments to unnest_tokens() used here are column names. First argument is the output column name that will be created as the text is unnested into it, and then the input column that the text comes from (product_services, in this case).
Below code visualize the distribution of tokenized words for prodct_service
Show the code
nodesToken %>%count(word, sort =TRUE) %>%top_n(30) %>%mutate(word =reorder(word, n)) %>%ggplot(aes(x = word, y = n)) +geom_col() +xlab(NULL) +coord_flip() +labs(x ="Count",y ="Unique words",title ="Count of unique words found in product_services field")
With above we saw the top frequently sensed word may not be useful, not in our interest. For example, NA, “a” and “to”. We will need to remove these words as stop words.
2.2 Remove stopwords and custom words and Filter data
From the token generated, we need to take out the common/generic words and we will also exclude NA records, and there are some words obviously doesn’t relate to fish activity such as machines, transportations etc. We can remove them as well.
Visualization with bar chart after remove stopword
With this visualization, we can detect none fishing related high frequency token and remove it by updating condition in last step. By several rounds of refinement, below is the diagram that have mostly fishing related keywords:
Show the code
tidy_stopwords %>%count(word, sort =TRUE) %>%top_n(30) %>%mutate(word =reorder(word, n)) %>%ggplot(aes(x = word, y = n)) +geom_col() +xlab(NULL) +coord_flip() +labs(x ="Count",y ="Unique words",title ="Count of unique words found in product_services field")
2.3 Prepare list of fishing related nodes and edges
We found the Maritime/fishing related keyword above, so we can use this knowledge to filter original edges and nodes to look at fishing related only.
Show the code
# Define your desired special words desired_words <-c("fish", "seafood", "frozen", "fresh", "salmon", "canned", "meat", "tuna", "shrimp", "shellfish", "sea", "squid", "seafoods", "marine", "fillets", "foods")# Filter the companies that contain the desired special words filtered_companies <-subset(nodesToken, grepl(paste(desired_words, collapse ="|"), id, ignore.case =TRUE)) MC3_Node_fish <- MC3_nodes_clean %>%filter(type =="Company"&grepl("fish|seafood|frozen|fresh|salmon|canned|tuna|shrimp|shellfish|sea|squid|fishing|seafoods|marine|fillets", product_services, ignore.case =TRUE)) MC3_edges_fishing <- MC3_edges_clean_Join %>%filter(source %in% MC3_Node_fish$id) %>%distinct()
3. Derive New Node Data, Building network model with tidygraph
There are rows with only id but no type information, for the “type” column in MC3_nodes_fishNetwork that is empty, we can potentially fill them with corresponding values from MC3_edges_clean because it provide target node types.
Show the code
#extract all fish activy related source id5 <- MC3_edges_fishing %>%select(source, type = sourceNodeType) %>%rename(id = source)#extract all fish activity related target id6 <- MC3_edges_fishing %>%select(target, type = edgeType) %>%rename(id = target) MC3_nodes_fishNetwork2 <-rbind(id5, id6) %>%distinct() %>%left_join(MC3_nodes_clean,unmatched ="drop") mc3_graphFish2 <-tbl_graph(nodes = MC3_nodes_fishNetwork2,edges = MC3_edges_fishing,directed =FALSE) %>%mutate(betweenness_centrality =centrality_betweenness(),closeness_centrality =centrality_closeness()) mc3_graphFish2 %>%filter(betweenness_centrality >=2) %>%ggraph(layout ="as_star") +geom_edge_link(aes(alpha=0.5)) +geom_node_point(aes(size = betweenness_centrality,colour = type,alpha =0.5)) +scale_size_continuous(range=c(1,10))+theme_graph()
It seems with higher betweenness, there’s higher revenue in the transactions
4.Consolidate counting information
With the above knowledge graph, we are interested to know 1. companies vs. owner count, for each company, how many owner does it have? 2. companies vs. company contacts, for each company, how many contacts does it have? 3. owners vs. companies, which are the owners that owns multiple companies?
After that we could categorize relationship manually, give them some labels
Show the code
# Counting Beneficial Owner and Company Contacts for each company company_counts <- MC3_edges_clean %>%filter(edgeType %in%c("Beneficial Owner", "Company Contacts")) %>%group_by(source, edgeType) %>%summarise(count =n()) %>%pivot_wider(names_from = edgeType, values_from = count, values_fill =0)# Counting companies owned by each Beneficial Owner owner_counts <- MC3_edges_clean %>%filter(edgeType =="Beneficial Owner") %>%group_by(target) %>%summarise(numOfCompanyOwned =n_distinct(source))# Update the nodes with the count information MC3_nodes_updated <- MC3_nodes_clean %>%left_join(company_counts, by =c("id"="source")) %>%left_join(owner_counts, by =c("id"="target"))
Generate some counts with the records based on relationship observed.
Show the code
# Counting Beneficial Owner and Company Contacts for each company company_counts1 <- MC3_edges_fishing %>%filter(edgeType %in%c("Beneficial Owner", "Company Contacts")) %>%group_by(source, edgeType) %>%summarise(count =n()) %>%pivot_wider(names_from = edgeType, values_from = count, values_fill =0)%>%rename(numOfBenOwner ="Beneficial Owner", numOfComContact ="Company Contacts")# Counting companies owned by each Beneficial Owner owner_counts1 <- MC3_edges_fishing %>%filter(edgeType =="Beneficial Owner") %>%group_by(target) %>%summarise(numOfCompanyOwned =n_distinct(source))# Update the nodes with the count information, and take out Undesired companies MC3_nodes_fishupdated <- MC3_nodes_fishNetwork2 %>%left_join(company_counts1, by =c("id"="source")) %>%left_join(owner_counts1, by =c("id"="target")) %>%distinct(id, .keep_all =TRUE) %>%filter(!(type =="Company"& product_services %in%c("Unknown", "character(0)")))# # visNetwork(MC3_nodes_fishupdated_filtered,# MC3_edges_fishing)
4.1 Understanding the company to owner, company to contact, owner to company relationship with the count distribution
Show the code
library(patchwork)# Calculate average values avg_ben_owner <-mean(MC3_nodes_fishupdated$numOfBenOwner, na.rm =TRUE) avg_com_contact <-mean(MC3_nodes_fishupdated$numOfComContact, na.rm =TRUE) avg_company_owned <-mean(MC3_nodes_fishupdated$numOfCompanyOwned, na.rm =TRUE)# Create separate histogram plots hist1 <-ggplot(MC3_nodes_fishupdated) +geom_histogram(aes(x = numOfBenOwner), fill ="skyblue", color ="black", bins =20) +geom_vline(xintercept = avg_ben_owner, color ="red", linetype ="dashed", size =1) +labs(title ="Distribution of number of Beneficial Owners of each company",x ="Number of Beneficial Owners",y ="Frequency",caption =paste("Average owners count:", avg_ben_owner)) +theme(plot.caption =element_text(hjust =0)) hist2 <-ggplot(MC3_nodes_fishupdated) +geom_histogram(aes(x = numOfComContact), fill ="lightgreen", color ="black", bins =20) +geom_vline(xintercept = avg_com_contact, color ="red", linetype ="dashed", size =1) +labs(title ="Distribution of number of Company Contacts of each company",x ="Number of Company Contacts",y ="Frequency",caption =paste("Average company contract count:", avg_com_contact)) +theme(plot.caption =element_text(hjust =0)) hist3 <-ggplot(MC3_nodes_fishupdated) +geom_histogram(aes(x = numOfCompanyOwned), fill ="lightpink", color ="black", bins =20) +geom_vline(xintercept = avg_company_owned, color ="red", linetype ="dashed", size =1) +labs(title ="Distribution of Companies Owned by Beneficial Owner",x ="Number of Companies Owned",y ="Frequency",caption =paste("Average companies owned count:", avg_company_owned)) +theme(plot.caption =element_text(hjust =0))# Arrange the plots vertically hist_combined <- hist1 / hist2 / hist3 +plot_layout(nrow =3) hist_combined
4.2 Potential anomalities labeling
Next, manually add some label for anomalies with the knowedge from the distribution above.
Show the code
# Detect outliers in numOfBenOwner outliers_ben_owner <-sort(boxplot.stats(MC3_nodes_fishupdated$numOfBenOwner)$out)# Detect outliers in numOfComContact outliers_com_contac <-sort(boxplot.stats(MC3_nodes_fishupdated$numOfComContact)$out)# Detect outliers in numOfCompanyOwned outliers_company_owned <-sort(boxplot.stats(MC3_nodes_fishupdated$numOfCompanyOwned)$out)# Print the outlier valuescat("Outliers in numOfBenOwner:", outliers_ben_owner, "\n")
MC3_nodes_fishupdated <- MC3_nodes_fishupdated %>%mutate(label =case_when( numOfBenOwner >=8~"Too Many Owners", numOfComContact ==0~"No Company Contacts", numOfComContact >=5~"Many Company Contacts", numOfCompanyOwned >=2~"Own more than 1 company",TRUE~"Normal" ))
Next we split the diagram to different view and investigate different abnormalities
First understand a rough spread among different type of nodes with different labels identified
Show the code
library(ggrepel) count_table <- MC3_nodes_fishupdated %>%group_by(type, label) %>%summarise(count =n()) %>%ungroup()# Plot the stacked bar chartggplot(data = MC3_nodes_fishupdated, aes(x = type, fill = label)) +geom_bar() +# geom_text(data = count_table, aes(label = count), vjust = -0.5, color = "black") +labs(title ="Source Node Types with Breakdown of different business pattern label",x ="Source Types",y ="Count") +scale_fill_discrete(name ="business pattern label")+coord_flip()
With the proportion displayed above, we will zoom into look at Too many Owners, Own more than 1 company, No Company Contacts for knowledge graph.
4.3 ompanies with Too Many Owners
First is about those companies with Too Many Owners (there are at least more than 8 of them)
Show the code
MC3_nodes_toomanyowners <- MC3_nodes_fishupdated%>%filter(label =="Too Many Owners") MC3_edges_toomanyowners <- MC3_edges_fishing %>%filter(source %in% MC3_nodes_toomanyowners$id)%>%rename(from = source)%>%rename(to = target) idS <- MC3_edges_toomanyowners %>%select(from) %>%rename(id = from) idT <- MC3_edges_toomanyowners %>%select(to) %>%rename(id = to) MC3_nodes_toomanyownersView <-rbind(idS, idT) %>%left_join(MC3_nodes_fishupdated, by =c("id"="id"))%>%distinct() %>%#define the type as different group for colorrename (group = type) visGraph <-visNetwork(MC3_nodes_toomanyownersView,MC3_edges_toomanyowners, width ="100%")%>%visIgraphLayout(layout ="layout_with_fr") %>%visNodes(id ="id", label ="numOfBenOwner") %>%visEdges(arrows ='to') %>%visOptions(selectedBy ="group",highlightNearest =list(enabled =TRUE,degree =1,hover =TRUE,labelOnly =TRUE),nodesIdSelection =TRUE ) %>%visInteraction(navigationButtons =TRUE)%>%visLegend() %>%visLayout(randomSeed =123) visGraph
4.4 Owners with more than one company
Show the code
group_colors <-c("Beneficial Owner"="lightpink1","Company"="cadetblue3","Company Contacts"="grey") MC3_nodes_abnormalOwner <- MC3_nodes_fishupdated%>%filter(label =="Own more than 1 company") MC3_edges_abnormalOwner <- MC3_edges_fishing %>%filter(target %in% MC3_nodes_abnormalOwner$id)%>%rename(from = source)%>%rename(to = target) idS1 <- MC3_edges_abnormalOwner %>%select(from) %>%rename(id = from) idT1 <- MC3_edges_abnormalOwner %>%select(to) %>%rename(id = to) MC3_nodes_abnormalOwnerView <-rbind(idS1, idT1) %>%left_join(MC3_nodes_fishupdated, by =c("id"="id"))%>%rename (group = type) %>%#define the type as different group for colordistinct() visGraph <-visNetwork(MC3_nodes_abnormalOwnerView,MC3_edges_abnormalOwner, width ="100%")%>%visIgraphLayout(layout ="layout_with_fr") %>%visNodes(color = group_colors) %>%# visEdges(arrows = 'to') %>%visOptions(nodesIdSelection =TRUE,selectedBy ="group",highlightNearest =list(enabled =TRUE,degree =1,hover =TRUE,labelOnly =TRUE) ) %>%visInteraction(navigationButtons =TRUE)%>%visLegend() %>%visLayout(randomSeed =123) visGraph
4.5 Companies with no contacts at all
Show the code
MC3_nodes_NoCompanyContacts <- MC3_nodes_fishupdated%>%filter(label =="No Company Contacts") MC3_edges_NoCompanyContacts <- MC3_edges_fishing %>%filter(source %in% MC3_nodes_NoCompanyContacts$id)%>%rename(from = source)%>%rename(to = target) idS2 <- MC3_edges_NoCompanyContacts %>%select(from) %>%rename(id = from) idT2 <- MC3_edges_NoCompanyContacts %>%select(to) %>%rename(id = to) MC3_nodes_NoCompanyContactsView <-rbind(idS2, idT2) %>%left_join(MC3_nodes_fishupdated, by =c("id"="id"))%>%rename (group = type) %>%#define the type as different group for colordistinct() visGraph <-visNetwork(MC3_nodes_NoCompanyContactsView,MC3_edges_NoCompanyContacts, width ="100%")%>%visIgraphLayout(layout ="layout_with_fr") %>%# visNodes(color = group_colors) %>%visEdges(arrows ='to') %>%visOptions(nodesIdSelection =TRUE,selectedBy ="group",highlightNearest =list(enabled =TRUE,degree =1,hover =TRUE,labelOnly =TRUE) ) %>%visInteraction(navigationButtons =TRUE)%>%visLegend() %>%visLayout(randomSeed =123) visGraph